Introduction¶

This document (Jupyter Notebook) performs a series of analyses and computations against the Iris dataset, housed by the University of California at Irvine in their machine learning repository [1]. The goal of this document is to use the Iris dataset to predict the Iris species of flower contain within it - between Setosa, Versicolour, or Viginica - based on the characteristics of the flower. The characteristics that will be used to determine the species of flower involves sepal length, sepal width, petal length, and petal width - all of which is recorded in centimeters.

Therefore, a set of steps will be taken alongside proof of calculations and code used to achieve this. Firstly, to retrieve the dataset from the above-mentioned repository and perform data cleaning. Secondly, execution of Exploratory Data Analysis (EDA) upon the dataset to extract any meaningful insights available. Thirdly, a Agglomerative Hierarchical Clustering Algorithm - which seeks to build a hierarchy of clusters, based on treating data points as individual clusters initially, before data pairs are merged together as traveling up the hierarchy [2].

1 - Import relevant libraries and retrieve Iris dataset¶

The first step that needs to be carried out is to install all the relevant python packages needed for the exercises within this document to perform required actions; as well as retrieve the Iris dataset to performs these actions upon.

In [ ]:
#==========================================================
# Step 0 - Import relevant python libraries/packages
#==========================================================
import numpy            as np                                                           # Used for mathematical and statical operations with arrays and matrices.
import pandas           as pd                                                           # Used for data analysis and manipulation.   
import plotly.express   as px                                                           # Used for interactive data visualisation.
import plotly.figure_factory as ff
import plotly.io        as pio                                                          # Used to assist rendering of plots in Visual Studio Code.
import scipy            as sci                                                          # Used algorithms for statistical modelling.

from scipy.spatial      import distance_matrix                                          # Used to perform distance matrix calculations
from scipy.spatial      import distance

pio.renderers.default = "notebook"                                                      # Configure rendering of plots to set up for a jupyter notebook.
#==========================================================
#/////////////////////////////////////////////////////////
#==========================================================
In [ ]:
#==========================================================
# Step 1 -  Retrieve Iris dataset
#==========================================================
iris_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'   # Url to retrieve the Iris dataset.

iris_raw = pd.read_csv(                                                                 # Parse Iris data retrieval into a Pandas DataFrame.
                        filepath_or_buffer = iris_url
                        ,header = None
                      )                                   

iris_raw.head(n = 10)                                                                   # Display a glimpse of raw iris data post retrieval.
#==========================================================
#/////////////////////////////////////////////////////////
#==========================================================
Out[ ]:
0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa

Now the libraries have been imported into the python code, and the data has been retrieved ready for use; additional work is required regarding data cleaning and formatting, to ensure the data is in the appropriate position before further steps such as EDA or Agglomerative Clustering can be perform to establish what record belongs to what species of flower.

2 - Data Cleaning¶

The following tasks to carry out data cleaning on the raw Iris dataset will include (but not limited to) checking and removal of missing information such as NA's/NULLS, applying column names, removal of the species category (due to the nature of what this document is aiming to achieve.)

In [ ]:
#==========================================================
# Step 2 - Data Cleaning on raw Iris data 
#==========================================================
iris_data = iris_raw                                                                    # Create new dataframe to house the soon to be cleaned dataset.

iris_data.columns = [                                                                   # Add column names to the dataset.
                        'sepal_length'
                        ,'sepal_width'
                        ,'petal_length'
                        ,'petal_width'
                        ,'species'
                    ]   

iris_data.dropna(axis= 0)                                                               # Remove any records that contain missing values.

iris_data['species'] = iris_data['species'].str.replace('Iris-','')                     # Remove the 'Iris-' from the species column, to make things more concise.

#iris_data = iris_data.drop('species', axis = 1)                                        # Remove the species columns from dataframe, as not required going forward.

iris_data.head(n=10)                                                                    # See a glimpse of the dataset, once cleaning changes have been made.
#==========================================================
#/////////////////////////////////////////////////////////
#==========================================================
Out[ ]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa

At this stage, the Iris dataset has had application of data cleaning processes placed upon it ready for the next stage; which includes providing column names, remove any missing records if applicable, and remove a column not needed to taken forward. Now data cleaning has been done, an extraction of meaningful statistics on the dataset can be carried out, to understand the data in greater depth.

3 - Extracting Information¶

The below code in python performs gathers a set of statistics to understand the Iris dataset in more detail. These tasks will gather descriptive statistics such as mean, minimum and maximum values amongst others.

3.1 - Summary statistics¶

In [ ]:
#==========================================================
# Step 3 - Extracting Information from Iris dataset
#==========================================================
# 3.1 -  Summary statistics
#----------------------------------------------------------
iris_data.describe()                                                                    # Provide summary statistics, including count, mean, standard deviation, minimum, and maximum values for each attribute.
Out[ ]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Based on the initial summary statistics across the dataset, a total count of 150 records have been found. Additionally, the average (mean) value across the attribute differ greatly, the smallest within the petal width at 1.198667cm - the part of the flower that is conspicuously coloured [3], and the largest within the sepal length at 5.843333cm - representing outer parts of a flower that enclose a developing bud [3-4]. In terms of minimum values, again petal width holds this with 0.1cm, as the smallest value, and sepal length being the largest at 4.3 cm. Lastly for maximum values, smallest of these being petal width at 2.5cm, and the largest for sepal length at 7.9cm. Now the statistcs have been looked upon at a high level, the next step is to repeat the summary statistics, but as per species and attribute.

In [ ]:
iris_data.groupby('species').describe()                                                  # Provide summary statistics per species, for sepal length - altogether.
Out[ ]:
sepal_length sepal_width ... petal_length petal_width
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
species
setosa 50.0 5.006 0.352490 4.3 4.800 5.0 5.2 5.8 50.0 3.418 ... 1.575 1.9 50.0 0.244 0.107210 0.1 0.2 0.2 0.3 0.6
versicolor 50.0 5.936 0.516171 4.9 5.600 5.9 6.3 7.0 50.0 2.770 ... 4.600 5.1 50.0 1.326 0.197753 1.0 1.2 1.3 1.5 1.8
virginica 50.0 6.588 0.635880 4.9 6.225 6.5 6.9 7.9 50.0 2.974 ... 5.875 6.9 50.0 2.026 0.274650 1.4 1.8 2.0 2.3 2.5

3 rows × 32 columns

In [ ]:
iris_data.groupby('species')['sepal_length'].describe()                                 # Provide summary statistics per species, for sepal length.
Out[ ]:
count mean std min 25% 50% 75% max
species
setosa 50.0 5.006 0.352490 4.3 4.800 5.0 5.2 5.8
versicolor 50.0 5.936 0.516171 4.9 5.600 5.9 6.3 7.0
virginica 50.0 6.588 0.635880 4.9 6.225 6.5 6.9 7.9

Based on sepal length,setosa appears the smallest average value, with virginica being the largest. However, the same pattern appears across the minimum and maximum values. A suggestion could be made from the figures on sepal length, that setosa is shorter of the species of Iris, with virginica being the longest.

In [ ]:
iris_data.groupby('species')['sepal_width'].describe()                                 # Provide summary statistics per species, for sepal width.
Out[ ]:
count mean std min 25% 50% 75% max
species
setosa 50.0 3.418 0.381024 2.3 3.125 3.4 3.675 4.4
versicolor 50.0 2.770 0.313798 2.0 2.525 2.8 3.000 3.4
virginica 50.0 2.974 0.322497 2.2 2.800 3.0 3.175 3.8

Based on sepal width,versicolor appears the smallest average value, with setosa being the largest. However, the same pattern appears across the minimum and maximum values. A suggestion could be made from the figures on sepal width, that versicolor is thinner of the species of Iris, with versicolor being the widest.

In [ ]:
iris_data.groupby('species')['petal_length'].describe()                                 # Provide summary statistics per species, for petal length.
Out[ ]:
count mean std min 25% 50% 75% max
species
setosa 50.0 1.464 0.173511 1.0 1.4 1.50 1.575 1.9
versicolor 50.0 4.260 0.469911 3.0 4.0 4.35 4.600 5.1
virginica 50.0 5.552 0.551895 4.5 5.1 5.55 5.875 6.9

Based on petal length, setosa appears the smallest mean value at 1.464cm, with virginica being the largest at 5.552cm. However, the same pattern appears across the minimum and maximum values. A suggestion could be made from the figures on petal length, that setosa species has the smallest petals, with virginica having the longest.

In [ ]:
iris_data.groupby('species')['petal_width'].describe()                                 # Provide summary statistics per species, for petal width.
Out[ ]:
count mean std min 25% 50% 75% max
species
setosa 50.0 0.244 0.107210 0.1 0.2 0.2 0.3 0.6
versicolor 50.0 1.326 0.197753 1.0 1.2 1.3 1.5 1.8
virginica 50.0 2.026 0.274650 1.4 1.8 2.0 2.3 2.5

Based on petal width, setosa appears the smallest mean value at 0.244cm, with virginica being the largest at 2.026cm. However, the same pattern appears across the minimum and maximum values. A suggestion could be made from the figures on petal width, that setosa species has the thinner petals, with virginica having the widest.

To Summarise, discoveries have been made on the records found within the Iris dataset. Such items suggest that the setosa species appears to be the smallest iris flower, and virginica being the largest variety based on the summary statistics uncovered. In order to examine this data further, the next step shall involve a set of visualisations to understand the data in further detail and explore what else can be discovered.

3.2 - Visualisations¶

In [ ]:
# 3.2 -  Visualisations
#----------------------------------------------------------
# Pairplot of the Iris dataset attributes
p_plot = px.scatter_matrix(                                                           # Create a scatter plot matrix of all attributes in the Iris dataset.                  
                            data_frame  = iris_data                                   # Declaring dataframe to use as iris_data.
                            ,dimensions = [                                           # Declaring the attributes in which to measure.
                                            'sepal_width'
                                            ,'sepal_length'
                                            ,'petal_width'
                                            ,'petal_length'
                                          ]
                            ,color      = 'species'                                   # Categorising the points based on the species of Iris flower.
                            ,symbol     = 'species'                                   # Providing each species or Iris flower a different symbol to aid in differentiation. 
                            ,title      = 'Pairplot matrix of the Iris dataset'       # Showing a title for plot to be created.
                            ,labels     = ({                                          # Remove underscores (_) from axis labels.
                                            col:col.replace('_', ' ')
                                            for col in iris_data.columns
                                          })
                              
                            )
p_plot.update_traces(diagonal_visible   = False)                                      # Remove plots that are measure the same attribute on the axes (e.g. sepal width vs sepal width).

p_plot.show()

Taking some inspiration from Plotly [5], the above pairplot has been created for the attributes found in the Iris dataset. Plots have been removed where the attributes on opposite side of the axes are the same, such as sepal width being on both the x and y axis, and as this only show a linear relationship of the same data, this was not deemed fruitful for the analysis. From the pairplot, there appears to be a set of distinct clusters forming; particularly where the setosa species shows separately compared with the other two species - especially where the plots are concerning the attributes of petal length and petal width against the others. Therefore, the next plot shall focus on comparing the petal widths against the petal lengths from the dataset.

In [ ]:
# Scatterplot of petal widths against petal lengths
#----------------------------------------------------------
s_plot = px.scatter(
                        data_frame = iris_data
                        ,x = 'petal_width'
                        ,y = 'petal_length'
                        ,color      = 'species'                                     # Categorising the points based on the species of Iris flower.
                        ,symbol     = 'species'                                     # Providing each species or Iris flower a different symbol to aid in differentiation.
                        ,title      = (                                 
                                       '''Petal Width vs '''  +                     
                                       '''Petal Length in Iris dataset'''
                                      ) 
                        ,labels     = ({                                            # Remove underscores (_) from axis labels.
                                          col:col.replace('_', ' ')
                                          for col in iris_data.columns
                                       })
                   )

s_plot.show()
#==========================================================
#/////////////////////////////////////////////////////////
#==========================================================

Using inspiration from Plotly once again [5], the above plot measures the petal widths against the petal lengths found in the Iris dataset. What appears to be apparent, is a distinct cluster found with the setosa species from the other types of Iris flowers. Furthermore, whilst virginica and versicolor are in close proximity of one another - with minimal overlap - it is clear that these can be declared as individual clusters within their own right.

To summarise, an argument can be made to state that 3 clusters can be found within the Iris dataset. These clusters could loosely relate to the species in which the Iris flower belongs to; for instance there is more clearer distinction between the setosa species and the other types of Iris. Conversely, the two species of viginica and versicolor are more closely aligned; though due to minimal overlap in data points, a distinction can be examined between them.

4 - Plotting single linkage dendograms¶

Now that we have performed EDA on the Iris dataset, and seen the potential clusters that could be formed due to the relevant factors, the next stage in this document attempts to visualise a single linkage dendogram on a subset of the dataset. To perform such a task, a series of steps are to be conducted to be achievable. Firstly, using the top six records found in the dataset based on both sepal widths and lengths. Secondly, ensure data is converted into a matrix. Thirdly, adopt an algorithm to execute the agglomerative hierarchical clustering algorithm with associated flowchart to show the steps involved. Fourthly, create a scatterplot to show the matrix of values at its initial starting point. Finally, perform the algorithm created previously to show how the clusters have been formed via a dendogram visualisation and Euclidean distance matrix. The final process shall be iterated multiple times over various epochs until a single cluster remains [6].

4.1 - Retrieving a subset of the Iris data¶

In [ ]:
#==========================================================
# Step 4 - Plotting single linkage dendograms
#==========================================================
# 4.1 -  Retrieving subset of Iris data
#----------------------------------------------------------
sub_data                = iris_data[['sepal_width','sepal_length']].head(n = 6)       # Create new dataframe from iris_data, take only sepal_width and sepal_length columns, and take only the top 6 records.

sub_data[['sepal_width','sepal_length']] = sub_data[['sepal_width','sepal_length']].apply(pd.to_numeric)

point_names             = ['p1','p2','p3','p4','p5','p6']                           # Create new series to join onto dataframe to name the points when they form coordinates onto a scatterplot.

sub_data['points']      = point_names                                               # Add series as column onto dataframe to name the points when they form coordinates onto a scatterplot.


dendo_data = pd.DataFrame(sub_data).set_index('points')
dendo_data
Out[ ]:
sepal_width sepal_length
points
p1 3.5 5.1
p2 3.0 4.9
p3 3.2 4.7
p4 3.1 4.6
p5 3.6 5.0
p6 3.9 5.4

4.2 - Initial scatterplot on data subset¶

for the below scatterplot and further steps, the sepal width attribute is assigned to the X axis, whereas the sepal length is assigned to the Y axis.

In [ ]:
# 4.2 - scatterplot of initial subset
#----------------------------------------------------------
i_plot = px.scatter(
                        data_frame  = dendo_data
                        ,x          = 'sepal_width'
                        ,y          = 'sepal_length'
                        ,text       = dendo_data.index                              # Add point names as data labels onto plot
                        ,title      = ('Initial scatterplot of subset') 
                        ,labels     = ({                                            # Remove underscores (_) from axis labels.
                                          col:col.replace('_', ' ')
                                          for col in iris_data.columns
                                       })
                   )

i_plot.update_traces(textposition = 'top center')

i_plot.show()       

Now the initial scatterplot has been created of all the points to apply the algorithm against, the next phase can begin to implement it. First the euclidean distance of these points will be created. Before commencement is conducted, the following flowchart attempts to outline the algorithmic process, based on iterations that would be performed for an agglomerative hierarchy clustering algorithm - at a high level.

4.3 Flowchart of process¶

4.3 - Euclidean Distance Matrix¶

The Euclidean Distance Matrix shall be created using the scipy package as inspired a blog post on stackoverflow [7] and confirm by the scipy documentation [8].

In [ ]:
# 4.3 - Euclidean distance matrix
#----------------------------------------------------------
# scipy Euclidean distance matrix code 
i_dist_matrix = pd.DataFrame(
                              distance_matrix(
                                                x  = dendo_data[[
                                                                  'sepal_width'
                                                            ,     'sepal_length'
                                                               ]]
                                                ,y = dendo_data[[
                                                                  'sepal_width'
                                                                  ,'sepal_length'
                                                               ]]
                                                ,p = 2                              # 2 represents l2 norm - Euclidean distance                    
                                             )
                              ,index = dendo_data.index
                              ,columns = dendo_data.index
                            )

i_dist_matrix
Out[ ]:
points p1 p2 p3 p4 p5 p6
points
p1 0.000000 0.538516 0.500000 0.640312 0.141421 0.500000
p2 0.538516 0.000000 0.282843 0.316228 0.608276 1.029563
p3 0.500000 0.282843 0.000000 0.141421 0.500000 0.989949
p4 0.640312 0.316228 0.141421 0.000000 0.640312 1.131371
p5 0.141421 0.608276 0.500000 0.640312 0.000000 0.500000
p6 0.500000 1.029563 0.989949 1.131371 0.500000 0.000000

Now that the Euclidean distance matrix has been created, the next step is to surmise between which points are the closest to each other, based on the minimum distance traveled. For this to occur, there is also a need to replace zeros with 9999 for the code to be able to find the lowest and valid number. The zeros found in the matrix at present only relate to distance calculations where both sides are measuring the same point (e.g. p1 - p1).

In [ ]:
# Determine the closest points, based on the minimum distance traveled between points.
#----------------------------------------------------------

e_dist_matrix               = i_dist_matrix.copy()                                  # Create new instance of distance matrix before making alterations.

e_dist_matrix.replace(0,9999, inplace = True)                                       # Replace zeros with 9999 so that when looking for minimum values, these values will not disturb.

min_value                   = e_dist_matrix.min(axis=1)                             # Retrieve the minimum value found in each row of the distance matrix.

e_dist_matrix['min_value']  = min_value                                             # Append the minimum values as an additional column onto the distance matrix.

min_label                   = e_dist_matrix.idxmin(axis=1)                          # Determine the point name that contains the minimum value.

e_dist_matrix['min_label']  = min_label                                             # Append the minimum value labels as an additional column onto the distance matrix.

e_dist_matrix.replace(9999,0, inplace = True)                                       # Convert 9999 back to 0

e_dist_matrix                                                                       # Preview the updated distance matrix.
Out[ ]:
points p1 p2 p3 p4 p5 p6 min_value min_label
points
p1 0.000000 0.538516 0.500000 0.640312 0.141421 0.500000 0.141421 p5
p2 0.538516 0.000000 0.282843 0.316228 0.608276 1.029563 0.282843 p3
p3 0.500000 0.282843 0.000000 0.141421 0.500000 0.989949 0.141421 p4
p4 0.640312 0.316228 0.141421 0.000000 0.640312 1.131371 0.141421 p3
p5 0.141421 0.608276 0.500000 0.640312 0.000000 0.500000 0.141421 p1
p6 0.500000 1.029563 0.989949 1.131371 0.500000 0.000000 0.500000 p5

Now that the Euclidean distance matrix has been created, found the minimum distance between points, and appended this information onto the matrix for clarity; the next step is to create creating a dendogram to show to output via visualisation [9].

4.4 - Dendogram creation¶

In [ ]:
# 4.4 - Dendogram creation
#----------------------------------------------------------             
dist_matrix    = i_dist_matrix.to_numpy()                                                     # Convert distance matrix into numpy array.

dist_s         = sci.spatial.distance.squareform(dist_matrix)                                 # Ensure Distance is converted into squareform.

linkage_matrix = sci.cluster.hierarchy.linkage(dist_s ,method='single', metric= 'euclidean')  # Create single linkage matrix based on smallest distance between points.

#dendogram      = sci.cluster.hierarchy.dendrogram(Z = linkage_matrix, labels = ['p1','p2','p3','p4','p5','p6'])

fig = ff.create_dendrogram(
                            i_dist_matrix
                            ,labels = ['p1','p2','p3','p4','p5','p6']   
                            ,                       
                          )

fig['layout'].update({'width': 800
                      ,'height':800
                      ,'title': 'Dendogram based on Euclidean distance matrix'
                      ,'xaxis':{'title':'Point names'}
                      ,'yaxis':{'title': 'Euclidean Distance'}
                    })

fig
#==========================================================
#/////////////////////////////////////////////////////////
#==========================================================

5 - How to interpret the dendrogram¶

Based on the output opf the dendogram provided above, there appears to show a set of two clusters, a potential indication that two species of Iris flower has been found within the data subset. For instance, the points of 2,3 and 4 are aligned together, along with points 6,1 ,and 5. However, where the points of 1 and 5 are similar in hight, point 6 has quite a height difference. Therefore, and argument could be made that point 6 could be a cluster in its own right.

References¶

[1] University of California at Irvine, “Iris Dataset Machine Learning Repository,” archive.ics.uci.edu. https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ (accessed Jul. 12, 2022).

[2] K. Sasirekha and P. Baby, “Agglomerative Hierarchical Clustering Algorithm- A Review,” International Journal of Scientific and Research Publications, vol. 3, no. 3, pp. 1–3, Mar. 2013, Accessed: Jul. 12, 2022. [Online]. Available: https://www.ijsrp.org/research-paper-0313.php?rp=P15831

[3] American Museum of Natural history, “Parts of a Flower: An Illustrated Guide | AMNH,” American Museum of Natural History, 2020. https://www.amnh.org/learn-teach/curriculum-collections/biodiversity-counts/plant-identification/plant-morphology/parts-of-a-flower#:~:text=Sepal%3A%20The%20outer%20parts%20of (accessed Jul. 15, 2022).

[4] B. D. Editors, “Sepal,” Biology Dictionary, Jan. 05, 2017. https://biologydictionary.net/sepal/ (accessed Jul. 15, 2022).

[5] Plotly, “Scatterplot Matrix in Python,” plotly.com, 2022. https://plotly.com/python/splom/ (accessed Jul. 15, 2022).

[6] Chaitanya Reddy, “Understanding the concept of Hierarchical clustering Technique,” Medium, Dec. 10, 2018. https://towardsdatascience.com/understanding-the-concept-of-hierarchical-clustering-technique-c6e8243758ec (accessed Jul. 15, 2022).

[7] Stackoverflow, “python - Creating a Distance Matrix?,” Stack Overflow. https://stackoverflow.com/questions/29481485/creating-a-distance-matrix (accessed Jul. 15, 2022).

[8] Scipy, “scipy.spatial.distance_matrix — SciPy v1.8.1 Manual,” docs.scipy.org, 2022. https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html (accessed Jul. 15, 2022).

[9] G. Chandrasekaran, “Manual Step by Step Single Link hierarchical clustering with dendogram.,” Analytics Vidhya, Jan. 06, 2020. https://medium.com/@gchandra/manual-step-by-step-single-link-hierarchical-clustering-with-dendogram-68cf1bbd737f (accessed Jul. 16, 2022).